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Abstract 

We consider the problem of learning a forest of nonlinear decision rules with general loss functions. 
The standard methods employ boosted decision trees such as Adaboost for exponential loss and Fried- 
man's gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a 
tree learner as a black box, the method we propose directly learns decision forests via fully-corrective 
regularized greedy search using the underlying forest structure. Our method achieves higher accuracy 
and smaller models than gradient boosting (and Adaboost with exponential loss) on many datasets. 



1 Introduction 



Many application problems in machine learning require learning nonlinear functions from data. A popular 



method to solve this problem is through decision tree learning (such as CART [Breiman et al. 1984] and 
C4.5 [Quinlan 1993}), which has an important advantage for handling heterogeneous data with ease when 
different features come from different sources. This makes decision trees a popular "black box" machine 
learning method that can be readily applied to any data without much tuning; in comparison, alternative 
algorithms such as neural networks require significantly more tuning. However, a disadvantage of decision 
tree learning is that it does not generally achieve the most accurate prediction performance, when compared 
to other methods. A remedy for this problem is through boosting [ Freund and Schapire| 1997| Friedman, 
200 1| Schapire| |2003 1, where one builds an additive model of decision trees by sequentially building trees 
one by one. In general "boosted decision trees" is regarded as the most effective black-box nonlinear learn- 
ing method for a wide range of application problems. 

In the boosted tree approach, one considers an additive model over multiple decision trees, and thus, 
we will refer to the resulting function as a decision forest. Other approach to learning decision forests 
include bagging and random forests [Breiman) 1 1 996| |200T| . In this context, we may view boosted decision 
tree algorithms as methods to learn decision forests by applying a greedy algorithm (boosting) on top of 
a decision tree base learner. This indirect approach is sometimes referred to as a wrapper approach (in 
this case, wrapping boosting procedure over decision tree base learner); the boosting wrapper simply treats 
the decision tree base learner as a black box and it does not take advantage of the tree structure itself. The 
advantage of such a wrapper approach is that the underlying base learner can be changed to other procedures 
with the same wrapper; the disadvantage is that for any specific base learner which may have additional 
structure to explore, a generic wrapper might not be the optimal aggregator. 

Due to the practical importance of boosted decision trees in applications, it is natural to ask whether 
one can design a more direct procedure that specifically learns decision forests without using a black-box 
decision tree learner under the wrapper. The purpose of doing so is that by directly taking advantage of 
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the underlying tree structure, we shall be able to design a more effective algorithm for learning the final 
nonlinear decision forest. This paper attempts to address this issue, where we propose a direct decision 
forest learning algorithm called Regularized Greedy Forest or RGF. We are specifically interested in an 
approach that can handle general loss functions (while, for example, Adaboost is specific to a certain loss 
function), which leads to a wider range of applicability. An existing method with this property is gradient 
boosting decision tree (GBDT) [Friedman} 2001 1. We show that RGF can deliver better results than GBDT 
on many datasets. 



2 Problem Setup 

We consider the problem of learning a single nonlinear function /i(x) on some input vector x = [x[l] , . . . , x[d]] G 
U. d from a set of training examples. In supervised learning, we are given a set of input vectors X = 
[xi, . . . , x n ] with labels Y = [yi, . . . , y m ] (here m may not equal to n). Our training goal is to find a 
nonlinear prediction function /i(x) from a function class T~L that minimizes a risk function 

h = &rgmm£(h(X),Y). (1) 
heH 

Here H is a pre-defined nonlinear function class, h(X) = [/i(xi), . . . , /i(x n )] is a vector of size n, and 
C(h, •) is a general loss function of vector h € W 1 . 

The loss function £(•,•) is given by the underlying problem. For example, for regression problems, we 
have yi E R and m = n. If we are interested in the conditional mean of y given x, then the underlying loss 
function corresponds to least squares regression as follows: 



c(h(x),Y) = Y / (H^)-y l ) 2 - 



In binary classification, we assume that y^ £ {±1} and m = n. We may consider the logistic regression 
loss function as follows: 

n 

£{h{X),Y) = ln (! + e~ h ^ Vi ). 

i=l 

Another important problem that has drawn much attention in recent years is the pair- wise preference learning 



(for example, see [Her brich et al.| [2000} |Freund et al.||2003| ]), where the goal is to learn a nonlinear function 



/i(x) so that h(x) > /i(x') when x is preferred over x'. In this case, m = n(n — 1), and the labels 
encode pair- wise preference as yuy) = 1 when Xj is preferred over Xj/, and yu^n = otherwise. For this 
problem, we may consider the following loss function that suffers a loss when /i(x) < /i(x') + 1. That is, 
the formulation encourages the separation of /i(x) and /i(x') by a margin when x is preferred over x': 

C(h(X), Y)= J2 max (0, 1 - (h(*i) - h{^))) 2 . 

( i ' i '):J/(i, I ')=l 

Given data (X, Y) and a general loss function £(•, •) in ([!]), there are two basic questions to address 
for nonlinear learning. The first is the form of nonlinear function class T~L, and the second is the learn- 
ing/optimization algorithm. This paper achieves nonlinearity by using additive models of the form: 



H 



A 



h(-) : h(x) = ^ajgjix); Vj,gj € C > , (2) 
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where each ay G E is a coefficient that can be optimized, and each <?j(x) is by itself a nonlinear function 
(which we may refer to as a nonlinear basis function or an atom) taken from a base function class C. The 
base function class typically has a simple form that can be used in the underlying algorithm. This work 
considers decision rules as the underlying base function class that is of the form 



C 



(3) 



where { (ij , tj ) , [ik , tk ) } are a set of (feature-index, threshold) pair, and T(x) denotes the indicator function: 
T(p) = 1 if p is true; otherwise. 

Since the space of decision rules is rather large, for computational purposes, we have to employ a 
structured search over the set of decision rules. The optimization procedure we propose is a structured 
greedy search algorithm which we call regularized greedy forest (RGF). To introduce RGF, we first discuss 



pros and cons of the existing method for general loss, gradient boosting |Friedman 2001 1, in the next 
section. 



3 Gradient Boosted Decision Tree 

Gradient boosting is a method to minimize ([TJ) with additive model ([2]) by assuming that there exists a 
nonlinear base learner (or oracle) A that satisfies Assumption [T] 

Assumption 1. A base learner for a nonlinear function class A is a regression optimization method that 
takes as input any pair X = [xi, . . . , x n ] and Y = [yi, . . . , y n ] and outputs a nonlinear function g = 
A(X, Y) that approximately solves the regression problem: 

n 

g « argmmmm^(/3 • g(kj) - yj) 2 . 
J=l 

The gradient boosting method is a wrapper (boosting) algorithm that solves ([I]) with a base learner A 
defined above and additive model defined in Q. The general algorithm is described in Algorithm [T] Of 
special interest for this paper and for general applications is the decision tree base learner, for which C is 
the class of J-leaf decision trees, with each node associated with a decision rule of the form Q. In order 
to take advantage of the fact that each element in C contains J (rather than one) decision rules, Algorifhm[T] 
can be modified by adding a partially corrective update step that optimizes all J coefficients associated with 
the J decision rules returned by A. This adaption was suggested by Friedman and implemented in our 
experimental comparisons. We shall refer to this modification as gradient boosted decision tree (GBDT), 
and the details are listed in Algorithm [5] 

Gradient boosting may be regarded as a functional generalization of gradient descent method 
_1 — s fc " \ h=h k _ 1 > where the shrinkage parameter s corresponds to the step size Sk in gradient de- 
scent, and — 9C qI^ \h=h k -! i s approximated using the regression tree output. The shrinkage parameter s > 
is a tuning parameter that can affect performance, as noticed by Friedman. In fact, the convergence of the 



algorithm generally requires choosing — > as indicated in the theoretical analysis of [Zhang and Yu 
2005 ] , which is also natural when we consider that it is analogous to step size in gradient descent. This is 
consistent with Friedman's own observation, who argued that in order to achieve good prediction perfor- 
mance (rather than computational efficiency), one should take as small a step size as possible (preferably 
infinitesimal step size each time), and the resulting procedure is often referred to as e-boosting. 
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Algorithm 1: Generic Gradient Boosting [Friedman, 2001] 



h (x)<r- arg min p C(p, Y) 
for k = 1 to K do 

Y k ^-d£(h,Y)/dh\ h=hk _ l(x) 
g k ^A(X,Y k ) 

ft^argmin /3eK £(/i fc _i(X) + /3 ■ g k (X),Y) 

h k (~x)<— hk-i{x) + sf3 k g k (~x) II s is a shrinkage parameter 

end 

return h(x) = ftjf (x) 



GBDT constructs a decision forest which is an additive model of K decision trees. The method has been 
very successful for many application problems, and its main advantage is that the method can automatically 
find nonlinear interactions via decision tree learning (which can easily deal with heterogeneous data), and 
it has relatively few tuning parameters for a nonlinear learning scheme (the main tuning parameters are the 
shrinkage parameter s, number of terminals per tree J, and the number of trees K). However, it has a 
number of disadvantages as well. First, there is no explicit regularization in the algorithm, and in fact, it 
is argued in [Zhang and Yu 2005 1 that the shrinkage parameter s plus early stopping (that is K) interact 
together as a form of regularization. In addition, the number of nodes J can also be regarded as a form 
of regularization. The interaction of these parameters in term of regularization is unclear, and the resulting 
implicit regularization may not be effective. The second issue is also a consequence of using small step size 
s as implicit regularization. Use of small s can lead to huge number of trees, which is very undesirable as 
it leads to high computational cost of applications (i.e., making predictions). Note that in order to achieve 
good performance, it is often necessary to choose a small shrinkage parameter s and hence large K; in the 
extremely scenario of e-boosting, one needs an infinite number of trees. Third, the regression tree learner is 
treated as a black box, and its only purpose is to return J nonlinear terminal decision rule basis functions. 
This again may not be effective because the procedure separates tree learning and forest learning, and hence 
the algorithm itself is not necessarily the most effective method to construct the decision forest. 



4 Fully- Corrective Greedy Update and Structured Sparsity Regularization 

As mentioned above, one disadvantage of gradient boosting is that in order to achieve good performance in 
practice, the shrinkage parameter s often needs to be small, and Friedman himself argued for infinitesimal 
step size. This practical observation is supported by the theoretical analysis in [Zhang a nd Yu[[2005[ which 
showed that if we vary the shrinkage s for each iteration k as s k , then for general loss functions with 
appropriate regularity conditions, the procedure converges as k — > oo if we choose the sequence s k such 
that s k\flk\ = °o and Yl k s k fik < 00 ■ This condition is analogous to a related condition for the step 
size of gradient descent method which also requires the step-size to approach zero. Fully Corrective Greedy 
Algorithm is a modification of Gradient Boosting that can avoid the potential small step size problem. The 
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procedure is described in Algorithm [2] 



Algorithm 2: Fully-Corrective Gradient Boosting [Shalev-Shwartz et aL 2010] 



h (x)<-&rg min p £(p,F) 
for k = 1 to K do 

Y k ^-dC(h,Y)/dh\ h=hk _ l(x) 
g k ^A{X, Y k ) 

letW fc = {Ej=ite(x):/3i€R} 
/i fc (x)^argmin /ieWfc Y) 
end 

return /i(x) = ftR-(x) 



// fully-coiTective step 



In gradient boosting of Algorithm[T](or its variation with tree base learner of Algorithm [5]), the algorithm 
only does a partial corrective step that optimizes either the coefficient of the last basis function g k (or the 
last J coefficients). The main difference of the fully-corrective gradient boosting is the fully-corrective- 
step that optimizes all coefficients {/3j}^ =1 for basis functions {gj}j =1 obtained so far at each iteration 
k. It was noticed empirically that such fully-corrective step can significantly accelerate the convergence of 
boosting procedures [War muth et al.| 2006[ . This observation was theoretically justified in [Shalev-Shwartz 



et al. 2010| where the following rate of convergence was obtained under suitable conditions: there exists a 
constant Cq such that 



C{h k {X),Y) < mf 



C(h(X),Y) + 



k 



where Cq is a constant that depends on properties of £(•, •) and the function class %, and 



inf 



^2\oij\ : h(X) = ^ a j9j(X); g i € C 



In comparison, with only partial corrective optimization as in the original gradient boosting, no such conver- 
gence rate is possible. Therefore the fully-corrective step is not only intuitively sensible, but also important 
theoretically. The use of fully-corrective update (combined with regularization) automatically removes the 
need for using the undesirable small step s needed in the traditional gradient boosting approach. 

However, such an aggressive greedy procedure will lead to quick overfitting of the data if not appropri- 
ately regularized (in gradient boosting, an implicit regularization effect is achieved by small step size s, as 



argued in [Zhang and Yu 2005 1). Therefore we are forced to impose an explicit regularization to prevent 
overfitting. 

This leads to the second idea in our approach, which is to impose explicit regularization via the concept 
of structured sparsity that has drawn much attention in recent years JBaraniuk et al. 2010 Jenatton et al. 



2009] |Jacob et al.| [20091 |Bachl [2008] [2009] |Huang et alT] [20TT| . The general idea of structured sparsity 
is that in a situation where a sparse solution is assumed, one can take advantage of the sparsity structure 
underlying the task. In our setting, we seek a sparse combination of decision rules (i.e., a compact model), 
and we have the forest structure to explore, which can be viewed as graph sparsity structures. Moreover, the 
problem can be considered as a variable selection problem. Search over all nonlinear interactions (atoms) 
over C is computationally difficult or infeasible; one has to impose structured search over atoms. The idea of 
structured sparsity is that by exploring the fact that not all sparsity patterns are equally likely, one can select 
appropriate variables (corresponding to decision rules in our setting) more effectively by preferring certain 
sparsity patterns more than others. For our purpose, one may impose structured regularization and search to 
prefer one sparsity pattern over another, exploring the underlying forest structure. 
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Algorithmically, one can explore the use of an underlying graph structure that connects different vari- 
ables in order to search over sparsity patterns. The general concept of graph-structured sparsity were con- 
sidered in [Bach, 2008 2009 Huang et al. 2011 1. A simple example is presented in Figure[TJ where each 
node of the graph indicates a variable (non-linear decision rule), and each gray node denotes a selected 
variable. The graph structure is used to select variables that form a connected region; that is, we may grow 
the region by following the edges from the variables that have been selected already, and new variables 
are selected one by one. Figure [T] indicates the order of selection. This approach reduces both statistical 
complexity and algorithmic complexity. The algorithmic advantage is quite obvious; statistically, using 
the information theoretical argument in [Huang et al. 201 1| , one can show that the generalization perfor- 
mance is characterized by 0(J2j e { se i ecte d nodes} 1°§2 degree(parent(j))), while without structure, it will be 



0(#{selected nodes} • lnp), where p is the total number of atoms in C. Based on this general idea, |Bach 
2009 ] considered the problem of learning with nonlinear kernels induced by an underlying graph. 



o — o — o — o — o — o — o — O 



— — — — — — — 




o — o — o — o — o 



Figure 1 : Graph Structured Sparsity 



This work considers the special but important case of learning a forest of nonlinear decision rules; al- 
though this may be considered as a special case of the general structured sparsity learning with an underlying 
graph, the problem itself is rich and important enough and hence requires a dedicated investigation. Specif- 
ically, we integrate this framework with specific tree-structured regularization and structured greedy search 
to obtain an effective algorithm that can outperform the popular and important gradient boosting method. 
In the context of nonlinear learning with graph structured sparsity, we note that a variant of boosting was 



proposed in preund and Mason} |1999[ , where the idea is to split trees not only at the leaf nodes, but also at 
the internal nodes at every step. However, the method is prone to overfitting due to the lack of regulariza- 
tion, and is computationally expensive due to the multiple splitting of internal nodes. We shall avoid such a 
strategy in this work. 



5 Regularized Greedy Forest 

The method we propose addresses the issues of the standard method GBDT described above by directly 
learning a decision forest via fully-corrective regularized greedy search. The key ideas discussed in Section 
[4] can be summarized as follows. 

First, we introduce an explicit regularization functional on the nonlinear function h and optimize 

h = axgmm[£(h(X),Y)+K(h)] (4) 
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instead of ([T]). In particular, we define regularizes that explicitly take advantage of individual tree structures. 

Second, we employ fully-corrective greedy algorithm which repeatedly re-optimizes the coefficients of 
all the decision rules obtained so far while rules are added into the forest by greedy search. Although such an 
aggressive greedy procedure could lead to quick overfitting if not appropriately regularized, our formulation 
includes explicit regularization to avoid overfitting and the problem of huge models caused by small s. 

Third, we perform structured greedy search directly over forest nodes based on the forest structure (graph 
sparsity structure) employing the concept of structured sparsity. At the conceptual level, our nonlinear 
function /i(x) is explicitly defined as an additive model on forest nodes (rather than trees) consistent with 
the underlying forest structure. In this framework, it is also possible to build a forest by growing multiple 
trees simultaneously. 

Before going into more detail, we shall introduce some definitions and notation that allow us to formally 
define the underlying formulations and procedures. 

5.1 Definitions and notation 

A forest is an ensemble of multiple decision trees Ti, . . . , Tr-. The forest shown in Figure [2] contains three 
trees Ti, T2, and T3. Each tree edge e is associated with a variable k e and threshold t e , and denotes a 
decision of the form Z(x.[k e ] < t e ) or Z(x[fc e ] > t e ). Each node denotes a nonlinear decision rule of the 
form Q, which is the product of decisions along the edges leading from the root to this node. 




Figure 2: Decision Forest 



Mathematically, each node v of the forest is associated with a decision rule of the form 

<? v (x) = \{Z(K[i 3 ] < t^Hl^ik] >U k ), 

j k 

which serves as a basis function or atom for the additive model considered in this paper. Note that if v\ and 
v 2 are the two children of v, then g v (x) = g vi (x) + g V2 (x) . This means that any internal node is redundant in 
the sense that an additive model with basis functions g v (x) , g Vl (x) , g V2 (x) can be represented as an additive 
model over basis functions g Vl (x) and g v , 2 (x). Therefore it can be shown that an additive model over all tree 
nodes always has an equivalent model (equivalent in terms of output) over leaf nodes only. This property 
is important for computational efficiency because it implies that we only have to consider additive models 
over leaf nodes. 

Let F represent a forest, and each node v of F is associated with (g v , a v ,6 v ). Here g v is the basis 
function that this node represents; a v is the weight or coefficient assigned to this node; and 9 V represents 
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other attributes of this node such as depth. The additive model of this forest F considered in this paper 
is: /ij-(x) = X^tieJ 7 a v9v{*) with a v = for any internal node v. We rewrite the regularized loss in (|4j) 
as follows to emphasize that the regularizer depends on the underlying forest structure, by replacing the 
regularization term IZ(hjr) with Q(F): 

Q{F)=L(h F {X),Y) + g{F). (5) 



5.2 Algorithmic framework 

The training objective of RGF is to build a forest that minimizes Q(F) defined in ((5). Since the exact 
optimum solution is difficult to find, we greedily select the basis functions and optimize the weights. At 
a high level, we may summarize RGF in a generic algorithm in Algorithm [3] It essentially has two main 
components as follows. 

• Fix the weights, and change the structure of the forest (which changes basis functions) so that the loss 
Q(F) is reduced the most (Line[2]-Q. 

• Fix the structure of the forest, and change the weights so that loss Q(F) is minimized (Line|5]). 



Algorithm 3: Regularized greedy forest framework 



F^{}. 
repeat 

d<— argmin og0 (jr) Q{o{F)) where O(F) is a set of all the structure-changing operations 
applicable to F. 

3 if (Q(o(J 7 )) > Q{F)) then break // Leave the loop if 6 does not reduce the loss. 

4 F<—d(F). II Perform the optimum operation. 

5 if some criterion is met then optimize the leaf weights in F to minimize loss Q{F). 
until some exit criterion is met; 

Optimize the leaf weights in F to minimize loss Q(F). 
return /ij-(x) 



5.3 Specific Implementation 

There may be more than one way to instantiate useful algorithms based on Algorithm|3] Below, we describe 
what we found effective and efficient. 

5.3.1 Search for the optimum structure change (Line [2]) 

For computational efficiency, we only allow the following two types of operations in the search strategy: 

• to split an existing leaf node, 

• to start a new tree (i.e., add a new stump to the forest). 
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The operations include assigning weights to new leaf nodes and setting zero to the node that was split. 
Search is done with the weights of all the existing leaf nodes fixed, by repeatedly evaluating the maximum 
loss reduction of all the possible structure changes. When it is prohibitively expensive to search the entire 
forest (and that is often the case with practical applications), we limit the search to the most recently-created 
t trees with the default choice of t = 1. This is the strategy in our current implementation. For example, 
Figure [3] shows that at the same stage as Figure [2] we may either consider splitting one of the leaf nodes 
marked with symbol X or grow a new tree T4 (split T4S root). 




Figure 3: Decision Forest Splitting Strategy (we may either split a leaf in T3 or start a new tree T4) 

Note that RGF does not require the tree size parameter needed in GBDT With RGF, the size of each 
tree is automatically determined as a result of minimizing the regularized loss. 

Computation Consider the evaluation of loss reduction by splitting a node associated with (g, a, 9) into 
the nodes associated with (g Ul ,a + S%, 6 Ul ) and (g U2 , a + 82 , U2 ) . Then the model associated with the new 
forest JF = o{F) after splitting the node can be written as: 

2 2 
hp{%) = /ij-(x) - a ■ flr(x) + ^2(a + <5 fc )5u fc (x) = /ij-(x) + • 9u k (x)- (6) 

k=l k=l 

Recall that our additive models are over leaf nodes only. The node that was split is no longer leaf and there- 
fore a ■ g(x) is removed from the model. The second equality is from g(x) = g Ul (x) + <7« 2 (x) due to the 
parent-child relationship. To emphasize that 5\ and 82 are the only variables in JF for the current computa- 
tion, let us write T(8i, 62) for the new forest. Our immediate goal here is to find arg min^ ^ 2 Q(JF(8\, 82))- 
Actual computation depends on Q{JF). In general, there may not be an analytical solution for this 
optimization problem, whereas we need to find the solution in an inexpensive manner as this computation 
is repeated frequently. For fast computation, one may employ gradient-descent approximation as used in 
gradient boosting. However, the sub-problem we are looking at is simpler, and thus instead of the simpler 
gradient descent approximation, we perform one Newton step which is more accurate; namely, we obtain 
the approximately optimum <5& (k = 1, 2) as: 



In particular, suppose that loss function is for either regression or classification tasks, then Q(F) can be 
written in the following form 



Q(F) = Y J l{hr(K i ),y i )/n + g{T) . 



i=l 

In this case, dh Q^ = g Uk (x), 9 ^52 *^ = 0, and hj?(x) = h-p, )(x) by (6); thus 5k in (7 1 can be rewritten 
as: 

~ Z^g Uh ( Xi )=l dh \h=hr(xi),y=yi n d8 k l<5i=0,<5 2 =0 

k ~ d 2 £(h,y) 1 7 d 2 g{T{S 1 ,S 2 )) \ ' 

2^gu k {xi)=i dh 2 \h=hjr{^),y= yi + n \s 1= o,5 2 =a 

For example, with square loss £(h, y) = (h — y) 2 /2 and L2 regularization penalty Q{F) = A Ylv&J 7 a v/2> 
we have 

^ fc (x l )=l 1 + nA 

which is the exact optimum for the given split. 
5.3.2 Weight optimization/correction (Line [5) 

With the basis functions fixed, the weights can be optimized using a standard procedure if the regularization 
penalty is standard (e.g., L\- or ^-penalty). In our implementation we perform coordinate descent, which 
iteratively goes through the basis functions and in each iteration updates the weights by a Newton step with 
a small step size: 

a v ^a v + y ^- 2 Q{T{Sv)h , (8) 



where S v is the additive change to a v . Computation of the Newton step is similar to Section 5.3.1 

Since the initial weights of new leaf nodes set in Line [4] are approximately optimum at the moment, it is 
not necessary to perform weight correction in every iteration, which is relatively expensive. We found that 
the strategy of "correcting the weights once every few new leaf nodes are added" works well. The interval 
between fully-corrective updates is not crucial as long as it is not too large. 

5.4 Tree-structured regularization 

Explicit regularization is a crucial component of this framework. To simplify notation, we define regularizes 
over a single tree. The regularizer over a forest can be obtained by adding the regularizes described here 
over all the trees. Therefore, suppose that we are given a tree T with an additive model over leaf nodes: 

h T (x) = ^2 a v9v(x) , » v = for v L T 

«er 

where Lt denotes the set of leaf nodes in T. 

To consider useful regularizers, first recall that for any additive model over leaf nodes only, there always 
exist equivalent models over all the nodes of the same tree that produce the same output. More precisely, let 
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Figure 4: Example of Equivalent Models 



A(v) denote the set of ancestor nodes of v and v itself, and let T((3) be a tree that has the same topological 
structure as T but whose node weights {a v } are replaced by {f3 v }. Then we have 

Vu <E L T : Pv = a u 44> h T ^)(x) = h T (x) 

veA(u) 

as illustrated in Figure [4j Our basic idea is that it is natural to give the same regularization penalty to 
all equivalent models defined on the same tree topology. One way to define a regularizer that satisfies 
this condition is to choose a model of some desirable properties as the unique representation for all the 
equivalent models and define the regularization penalty based on this unique representation. This is the 
high-level strategy we take. That is, we consider the following form of regularization: 

Q{T) = ^r{p v ,6 v ) : ftr(/>)(x) = M x ) ■ 

Here node v includes both internal and leaf nodes; the additive model ^t T (p)( x ) serves as the unique repre- 
sentation of the set of equivalent models; and r(-, •) is a penalty function of v's weight and attributes. Each 
p v is a function of given leaf weights {a u } u& L T , though the function may not be a closed form. Since reg- 
ularizes in this form utilize the entire tree including its topological structure, we call them tree-structured 
regularize™. Below, we describe three tree-structured regularizes using three distinct unique representa- 
tions. 

5.4.1 L2 regularization on leaf-only models 

The first regularizer we introduce simply chooses the given leaf-only model as the unique representation 
(namely, p v = a v ) and sets r(p v , 6 V ) = based on the standard L2 regularization. This leads to 

where A is a constant for controlling the strength of regularization. A desirable property of this unique 
representation is that among the equivalent models, the leaf-only model is often (but not alway^j) the one 
with the smallest number of basis functions, i.e., the most sparse. 

5.4.2 Minimum-penalty regularization 

Another approach we consider is to choose the model that minimizes some penalty as the unique representa- 
tive of all the equivalent models, as it is the most preferable model according to the defined penalty. We call 

1 For example, consider a leaf-only model on a stump whose two sibling leaf nodes have the same weight a/0. Its equivalent 
model with the fewest basis functions (with nonzero coefficients) is the one whose weight is a on the root and zero on the two leaf 
nodes. 
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this type of regularizer a min-penalty regularizer. In the following min-penalty regularizer, the complexity 
of a basis function is explicitly regularized via the node depth. 



G(T) = A • min ^ - 7 *# : ^(x) = ^(x) }> . (9) 

Here d„ is the depth of node v, which is the distance from the root, and 7 is a constant. A larger 7 > 1 
penalizes deeper nodes more severely, which are associated with more complex decision rules, and we 
assume that 7 > 1. 

Computation To derive an algorithm for computing this regularizer, first we introduce auxiliary variables 
{Pv}veT, recursively defined as: 

Pot — Pot 1 Pv — Pv ~\~ Pp(y) 1 

where or is T's root, and p(v) is u's parent node, so that we have 

hr(P) = hr <&Vv G L T . [j3 v = a v ] , (10) 

and ([9]) can be rewritten as: 

g(T) = A • m_in{/({/34) : G L T .[ p v = a v }} (11) 

{fiv} 

where /({&}) = £ 7 d » (~Pv - P p(v) ) 2 /2 + p 2 OT /2 . (12) 

Vy^OT 



Setting /'s partial derivatives to zero, we obtain that at the optimum, 



v ^ Ot 



i.e., essentially, /3„ is the weighted average of the neighbors. This naturally leads to an iterative algorithm 
summarized in Algorithm |4| Convergence of this algorithm and some more computational detail of this 
regularizer are shown in the Appendix. 

Algorithm 4: 



for.GTdo^j Q 



for i = 1 to m do 

for v G Lt do P v ,i^a v 



for v ^ Lt do p v 



end 

return {P v>m } 



_i+2 7 v + °T 
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Figure 5: Example of Sum-to-zero Sibling Model 



5.4.3 Min-penalty regularization with sum-to-zero sibling constraints 

Another regularizer we introduce is based on the same basic idea as above but is computationally simpler. 
We add to (|9]> the constraint that the sum of weights for every sibling pair must be zero, 

G(T) = A • mm \ : /i T(/3) (x) = h T (x); \/v £ L T . 

[ver 

as illustrated in Figure [5] The intuition behind this sum-to-zero sibling constraints is that less redundant 
models are preferable and that the models are the least redundant when branches at every internal node lead 
to completely opposite actions, namely, 'adding x to' versus 'subtracting x from' the output value. 

Using the auxiliary variables {/3 V } as defined above, it is straightforward to show that any set of equiv- 
alent models has exactly one model that satisfies the sum-to-zero sibling constraints. This model, whose 
coefficients are {3 V }, can be obtained through the following recursive computation: 

8 ~ i ° v - VeLT (14) 

More computational details of this regularizer is described in the Appendix. 



A» = o 

p(w)=v 




6 Experiments 

This section reports empirical studies of RGF in comparison with GBDT and some other tree ensemble 
methods including AdaBoost. For simplicity, we focus on square loss for regression and binary classification 
tasks, although some results with exponential loss and multi-class categorization results are also shown. 
All experimental results in this paper can be reproduced using the RGF software available from |http : | 

// rie johnson . com/ rgf_download . html| 



6.1 Parameter settings and some computational detail 
6.1.1 Regularized greedy forest 



RGF-L2 (RGF with the regularizer in Section 5.4.1 1 was tested with the regularization parameter A set to 
one of {1,0.1,0.01}, with a few exce ptions in expone ntial loss experiments (described later). RGF with 

and 



min-penalty regularization in Sections |5.4.2 
In all the configurations, the search for the 



best 



5.4.3 
node 



was tested with 7 e {2,4} and A G {L^, 
split was limited to the most recently-created tree, 



and weight optimization was done after every 100 leaf nodes. Weight optimization was done by coordinate 
descent with step size set to 0.5. The number of iterations was 10 for square loss and 5 for exponential loss. 
On the regression task, /ij-(x) was fitted to {(xj, yi — y)}i, where y = ^27=1 W n > an d the final output was 
set to /ij-(x) + y. 
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6.1.2 Gradient Boosted Decision Tree 



Gradient boosted decision tree (GBDT) in Algorithm]?} as proposed in [Friedman) 2001] , is an adaption of 
the generic gradient boosting method in Algorithm [T] This is the algorithm implemented in this work for 
comparison. 



Algorithm 5: Gradient Boosted Decision Tree (GBDT) | Friedman 2001 [ 



h (x)<- arg min p C(p, Y) 
for k = 1 to K do 

Y k ^-dC(h,Y)/dh\ h=hk _ l{x) 

Build a J-leaf decision tree Tk<—A(X, Y k ) with leaf-nodes {gk,j}j=i 
for j = 1 to J do /3 fcii ^argmin ( g eR £(/i fc _ 1 (X) + /3 • g kJ {X),Y) 
%(x)-«— /tjt-i(x) + s J2j=i PkJ ' 5fc,j( x ) // s is a shrinkage parameter 
end 

return /i(x) = Tijf (x) 

With square loss, j coincides with the leaf weight computed by the tree builder, which does not need to 



be recomputed. With exponential loss, approximation by the Newton step (3k 



E 9fc .(x)=i Vi ex p(- h k-i(xi)yi) 

1 ' E 9fej . (x) =i exp(-h fe _ 1 (x i )3/ i ) 



was performed. Following [FriedmanJ 2001 1, the nodes were split in the best-first manner. For tree size J, six 
values {2, 4, 8, 12, 16, 64} were tested, and for the shrinkage parameter s, seven values {1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001} 
were tested, which makes 42 configurations for each loss type. 



6.1.3 Common settings 

Whenever applicable, node split was prohibited if it would cause a leaf node to have training data points 
fewer than 10, as is commonly done. In the computation of the optimum 5k ^ for exponential loss, the 

E 9u ( X )=l Vie-y^F^+V-nXa-eV 

following was done for numerical stability: 5h< ^* _„ h tz \tt. — ; where 

fx<r- max(— 500, min(500, ^ YJi=x 2/i^( x i))) • Similar protection was done in the weight correction pro- 
cedure and for GBDT with exponential loss. 



6.2 Comparison with GBDT 

Due to our interest in sparse models, we plot performance against the number of leaf nodes, which is equal 
to the number of basis functions of the model. As the number of tested configurations for GBDT is too 
large to fit in one graph, only a few best-performing configurations (measured by the peak performances) 
are shown for GBDT. In the legends, the first number following "gb" is the tree size and the second number 
is the shrinkage parameter, e.g., "gb8:s=0.1" means GBDT with 8-leaf trees with shrinkage parameter 0.1. 



6.2.1 On the synthesized datasets controlling complexity of target functions 

First we study the behavior of the methods in relation to the complexity of target functions using synthesized 
datasets. To synthesize datasets, first we defined the target function by randomly generating 100 of g-leaf 
regression trees; then we randomly generated data points and applied the target function to them to assign 
the output/target values. The dimensionality of data points was 10. Note that a larger tree size q makes the 
target function more complex. The results shown in Figures^ and u\ are square error (y — h) 2 averaged over 



14 




64-leaf synth 

rgf:X=l 

rgf:X=0.1 

rg&=0.01 

gb8:s=0.1 

gbl2:s=0.1 

gb4:s=0.1 



0.35 



5000 10000 
#leaf 




16-leaf synth 

rgf:lam=0. 1 

rgf:lam=0.01 

gb4:s=0.1 

gb8:s=0.1 

gbl2:s=0.1 



5000 10000 
#leaf 



Figure 6: Performance Comparison on Synthetic Data (regression with complex targets) 
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Figure 7: Performance Comparison on Synthetic Data (regression with simple targets) 



the 10 datasets generated with different random seeds. In each run, 2K data points were used for training 
and the number of test data points was 20K. 

The first results (Figure [6]> are on the datasets whose target functions were defined by 64-leaf trees and 
16-leaf trees, which are relatively complex. RGF-L2 (blue solid lines) achieves smaller error than GBDT 
(black dotted lines) with all the tested values of regularization parameter A. The next results in Figure [7] are 
on the datasets whose target functions are defined by 4- and 2-leaf trees (stumps), which are relatively simple. 
RGF-L2 (blue solid lines) achieves smaller error than GBDT when A is appropriate, but the merit somewhat 
diminishes on the stump datasets. RGF-L2 seems to be particularly effective for relatively complex target 
functions, on which regularization can help to avoid overfitting. 

On the datasets with simpler targets in Figure [7] RGF with the min-penalty regularization (green lines 
with long dashes) is shown to be effective. The configurations shown here set 7 = 4 or 2 and A = O.OI/7, 
which encourages simpler decision rules, and outperform not only GBDT but also RGF-L2. The two min- 
penalty regularizes (with and without the sibling constraint) achieved similar performance on these datasets 
and only the results with the sibling constraints are shown in the figure. Similar results were obtained on 
the synthesized 2-way classification dataset^] (Figure [8J. All the synthesized datasets are provided with the 
RGF software. 



6.2.2 Regression and 2-way classification tasks on the real-world datasets 



On the real-world datasets, we report the regression and binary classification results with square loss (Figure 
[9]) and binary classification results with exponential loss (Figure 10). The tested datasets and task s are 
summarized in Table[l] All except House^ are from the UCI repository |Frank and Asuncion 2010 1. All 

2 To synthesize datasets for the 2-way classification task, after performing the same procedure for regression data generation, 
we changed the target values to +1 or —1 (corresponding to the positive/negative class) based on whether the original target value 
is above or below the median. 

3[ 



3 |http: 



//lib . stat . emu . edu 
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Figure 8: : Performance Comparison on Synthetic Data (classification with complex targets (up) and simple 
targets (down)) 
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Regression or binary tasks 


Houses 


6 


Target:log(median house price) 




CT slices 


384 


Target: relative location of 


Adult 


14(168) 


Is income > $50K? 








CT slices 


Letter 


16 


A-M vs N-Z 




Musk2 


166 


Musk or not 


Nursery 


8(24) 


"Special priority" or not 




Waveform2 


40 


Class2 vs. Classl&3 



Table 1: Real-world Datasets. We report the average of 3 runs, each of which uses 2K training data points. The 
numbers in parentheses indicate the dimensionality after converting categorical attributes to indicator vectors. 



the results are the average of 3 runs, each of which used randomly-drawn 2K training data points. For 
multi-class data, binary tasks were generated as follows; A-M vs. N-Z on Letter; special priority or not on 



Nursery; and Class2 vs. others on Waveform2. On House, as suggested by the creators of the data [Pace 



and Ronge| 1997 1, we predict the logarithm of the median house price of the region based on the median 



income, the median house age, total rooms/population, total bedrooms/population, population/households, 
and households. The original task associated with Musk2 is multi-instance learning. We use this dataset 
for single-instance learning by regarding all the instances independently. The official test sets were used as 
test sets if any (Letter and Adult). For relatively large Nursery and Houses, 5K data points were held out 
as test sets. For relatively small Musk2 and Waveform2, in each run, 2K data points were randomly chosen 
as training sets, and the rest were used as test sets. On CT slices, the instances with even patient IDs were 
used as test data. Categorical attributes, contained in Adult and Nursery, were converted to binary indicator 
vectors. The exact partitions of training and test data after data conversion described here are provided with 
the RGF software. 

In Figure [9j we show regression and binary classification results with square loss on the real-world 
datasets. RGF-L2 (blue solid lines) shows merit over GBDT in terms of higher accuracy or sparser models 
on all but the Adult dataset. The min-penalty regularization with 7 > 1 was found to be effective on Musk2 
but not on the other datasets. Our conjecture based on the synthesized data experiments is that the unknown 
target functions underlying these real-world datasets are relatively complex except for Musk2. On Adult 
and Houses, RGF has limited success in terms of its peak accuracy. On these datasets performances appear 
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Figure 9: Performance Comparison on Real Data (square loss, regression and 2-way classification). Average 
of 3 runs. Each run used 2K training data points. 



to converge with a smaller number of basis functions (leaf nodes) than other datasets. The indication is 
that RGF is particularly useful when a relatively large number of basis functions is required to model the 
underlying target functions; this is one of the situations when RGF's fully-corrective update and explicit 
regularization can make a real difference. 

Figure 10 shows the results of RGF and GBDT with exponential loss ("xrgf" and "xgb"). The square loss 
results ("rgf " and "gb") and AdaBoost performances ("ada") are also shown for comparison. The base learn- 
ers for AdaBoost were regression trees generated in the best-first manner with tree size in {2, 4, 8, 12, 16, 64} 
as in the GBDT experiments. For AdaBoost, the best-performing configuration among these six configura- 
tions is shown. It appears that exponential loss generally requires weaker regularization than square loss. 
On some datasets the effective range of regularization parameters are smaller; e.g., A = 0.1 with square loss 
vs. A = le — 10 with exponential loss on Letter. Similarly, larger shrinkage parameters such as s = 1 or 
s = 0.5 worked well with GBDT with exponential loss. Consequently, GBDT's lack of explicit regulariza- 
tion is less harmful with exponential loss, and on some datasets GBDT and RGF (and AdaBoost) achieved 
similar performance. However, RGF with either square loss or exponential loss generally produced more 
accurate/sparse models than GBDT or AdaBoost. On some datasets, RGF with exponential loss performs 
better than RGF with square loss while it does not on other datasets. There may be other regularization 
methods that are more suitable to exponential loss, but we did not explore this direction in this paper. 



6.2.3 Multi-class categorization tasks of larger scale 

We report the multi-class categorization experiments of larger scale using two datasets: Letter (26 classes; 
16K training data points; 4K test data points) and MNIST^](10 classes; 60K training data points; 10K test 
data points). We also tested discrete AdaBoost with trees as base learners. As AdaBoost is closely related 
to exponential loss, we tested RGF-L2 and GBDT with exponential loss as well as square loss. There are 
multi-class training methods for both GBDT and AdaBoost, and RGF can be extended similarly. But in this 

4 |http : / /yann . lecun . com/exdb/mnist/| 
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Figure 10: Performance Comparison on Synthetic Data and Real Data (exponential loss, 2-way classifica- 
tion). Average of 10 (synthetic) or 3 (real) runs. Each run used 2K ttaining data points. 



work, with all the methods, we simply trained m models for m classes using the 'one-vs-all' scheme while 
weighting loss for compensating for skewed class distributions: 2 \s + \ Sies+ ^ + 2\s I Sies_ where li 
is the loss on the i-th data point, and S + and S- represent the sets of positive and negative training data 
points, respectively. 

The parameters (tree size for AdaBoost; loss function, tree size, and shrinkage parameter s for GBDT; 
loss function and A for RGF-L2) were chosen on the training sets by 2-fold cross validation on Letter and one 
run of 4: 1 -split on MNIST. As it was computationally challenging to run the 84 configurations (42 times two 
loss functions) for GBDT, we focused on 30 most promising configurations selected using smaller portions. 
Similarly, we focused on four RGF configurations (square loss with A G {0.1, 0.01} and exponential loss 
with A E {le — 10, le — 20}). Note that we tested more configurations for GBDT and AdaBoost than for 
RGF to explore the potential of these alternatives the best we can. The settings chosen based on the training 
sets were: RGF: exponential loss and A = le — 20 (MNIST); square loss and A = 0.1 (Letter); GBDT: 
exponential loss and s = 1 for both and tree-size 16 (MNIST) and 64 (Letter); tree-size 16 for AdaBoost on 
both. 

With these settings, training was done until the number of leaf nodes (per class) reached 30K. This 
number was chosen since on the training data split the best-performing configurations of each tested method 
appeared to mostly converge before that. On the test data, it turned out that this observation also held, 
and there was no clear sign of overfilling in this range of model sizes. In Table |2| we report the average 
performance of the last several models. That is, the numbers in the table are error rates (%) averaged over 
the models whose sizes are between #leaf=25K and 30K. 





RGF 


GBDT 


AdaBoost 


MNIST 


1.40 (0.02) 


1.62 (0.02) 


1.46 (0.02) 


Letter 


2.24 (0.01) 


2.58 (0.03) 


2.45 (0.03) 



Table 2: Error Rates (%) on Large Scale Multi-Class Data. Averaged over the last 10 models whose sizes are 
between 25K and 30K (per class) in terms of #leaf. The numbers in parentheses are standard deviation. 
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Figure 1 1 : Comparison on MNIST. 



On both datasets, RGF-L2 achieved lower error than the others. On Letter, our 2.24% exceeds 2.35% 
described as state of the art [Kegl and Busa-Fekete| 2009]. On MNIST, the difference between RGF and 
AdaBoost is small but as seen in Figure [TT] they are clearly separated. Also, RGF shows merit of smaller 
models; at #leaf=15K, RGF's error rate already reaches 1.37% whereas AdaBoost is still 1.52% and it takes 
22K leaf nodes for AdaBoost to come down to 1.46%. We note that our AdaBoost performance is better 
than the best figure reported in the literature [Kegl and Busa-Fekete 2009] (1.53%). We consider this as 
a testimony that we have thoroughly explored the potential of AdaBoost in our experiments. RGF's 1.4% 
exceeds a number of neural networks in |LeCun et al.[ [!998[, and it is as good as SVM with 4-degree 
polynomial kernel [Burges and Scholkopf , 1997], though it falls behind the convolutional nets LeNet-4 and 
-5 (<1.1%) | |LeCun et al. 1998[ . Considering that RGF is a generic nonlinear learner that is not even 
designed for image classification applications, its performance is surprisingly competitive. 



Search for the best performance of GBDT with square loss During parameter selection, we noticed that 
with model sizes of #leaf=30K or fewer, GBDT with square loss (GBDT-square) often achieves performance 
inferior to the others, although the performance may improve with models of larger sizes. To find out how 
well GBDT-square can potentially perform, we increased the model size and directly searched for the best 
performance of GBDT-square on the official training/test splits. This is a computationally expensive process 
especially on MNIST (about 300 hours of computation on MNIST) which cannot be easily repeated. The 
best error rates of GBDT-square we found by using this exhaustive search procedure are 3.08% on Letter, 
obtained with s = 0.1 and 4-leaf trees at #leaf=164.5K; and 2.16% on MNIST, obtained with s = 0.1 and 
4-leaf trees at #leaf=83K. These are still worse than the error rates in Table [2] (achieved for GBDT with 
exponential loss) and the model sizes are three to five times larger. By contrast, the error rate of RGF-L2 
with square loss (on average over the models of #leaf=25K-30K) is: 2.24% on Letter, which is the lowest 
among the tested methods as shown in Table [2j and 1.54% with A = 0.01 on MNIST, which exceeds 
GBDT with both square and exponential losses. The results are consistent with the the regression and 2-way 
classification experiments in the previous sections, which we view as a further support for the proposed 
RGF's approach. 



6.2.4 GBDT with post processing of fully-corrective updates 



A two-stage approach was proposed in [Friedman and Popescu 2003] that first performs GBDT to learn 
basis functions and then fits their weights with L\ penalty in the post-processing stage. Note that by contrast 
RGF generates basis functions and optimizes their weights in an interleaving manner so that fully-corrected 
weights can influence generation of the next basis functions. Figure [12] shows representative results of their 



two-stage approach on the regression and 2-way classification tasks described in Sections 6.2.1 and 6.2.2 



As is well known, L\ regularization has "feature selection" effects, assigning zero weights to more and more 
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Figure 12: GBDT with L\ regularized post-processing. 



features with stronger regularization. After performing GBDT until the maximum number of leaf nodes in 



the graphs (e.g., 10000 on Letter) was obtained, we used the R package glmnet [Friedman et al. 2011 1 
to compute the entire L\ path in which the regularization parameter goes down gradually and thus more 
and more basis functions obtain nonzero weights. The solid lines plot the performance of the entire L\ 
path in relation to the number of nonzero weights. The dotted lines are GBDT without post-processing for 
comparison. In both, GBDT was performed with the best-performing parameters shown in Figures [6] and 
[9] The graphs show that the L\ post processing makes the models sparser so that the performance peak is 
obtained with fewer basis functions. But it does not necessarily improve the accuracy of the models; rather, 
accuracy is degraded in some cases. We view that the results support RGF's interleaving approach. 

6.3 Comparison with other tree ensemble methods 

In the previous section, we studied RGF's performance in comparison with GBDT. This section reports 
empirical comparison with other types of tree ensemble methods on the regression and 2-way classification 
tasks described in Sections 6.2.1 and 6.2.2 As in Figures|6 - 10 all the results are the average of either 10 or 



3 runs (10 on the synthesized data and 3 on the others), each of which used 2K training data points. 
6.3.1 Random forests 

Random forests generate K trees from K random inputs generated by random draw of training samples 



and features. We used the R package randomForest | Brieman et al.| [2010| and performed random for- 



est training with the number of randomly-drawn features k in {1, |, |, ^ default}, where d is the feature 
dimensionality and the 'default' is the value recommended by the system. For each of these five configura- 



tions, the number of trees was varied from 1 to 10000. In Figure 13 we show the results of the best (at the 
peak) random forest configuration among the five configurations in black lines. Since we are interested in 
compact models, we plot performance in relation to the number of leaf nodes of the forests (in the log-scale). 
The blue lines are RGF-L2 with square loss. On most of the datasets, RGF achieves higher accuracy (or 
lower error) with much fewer leaf nodes, compared with the best-performing random forests configuration. 

6.3.2 Bayesian Additive Regression Trees (BART) 



Bayesian Additive Regression Trees (BART) [Chipm an et al.| 2010 1 is a Bayesian approach to tree ensemble 



learning. We used the R package BayesTree [Chipman and mcCulloch 2010) for the experiments. It is 



interesting to see how RGF empirically compares to BART, as they share some high-level strategies such 
as explicit regularization and non-black-box approaches to tree learners. BART starts with m trees each 
of which consists of the root node only, and in each iteration, the m trees are either grown or pruned. 
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;ure 13: Comparison with Random forests. Average of 10 or 3 runs. Each run used 2K training data points. 
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Figure 14: Comparison with BART. Average of 10 or 3 runs. Each run used 2K training data points. 
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This process is viewed as generating a sequence of additive models of m trees, which is converging to the 
posterior distribution on the "true" model. Prediction is done by taking the average of the predictions of K 
models from K iterations after some number of burn-in iterations. Therefore, the final model consists of 
m x K trees. 

In addition to m and K above, BART has five parameters to control regularization. We did not attempt 
to tune all the parameters as [Chipman et al. 201 Of reports that the default/recommended values work well, 
and as BART training is relatively time-consuming (roughly 100 times longer than GBDT according to 
[Chipma n et al.| 2010 1). After preliminary experiments, we found that performance can often be improved 
by appropriately choosing the weight shrinkage parameter k from {1,2,3} while the others fixed to the 
default values. Below we report the results of the best BART configuration (at its peak) among these three 
configurations. 

Figure 14 shows performance in relation to the number of trees in the final model (in the log-scale), 
which is m x K for BART where m = 200 (default) and K (the number of the additive models to be 
averaged) varies from 1 to 1000. Unlike previous figures, the model size is represented by the number of 
trees instead of leaf nodes since the BART package does not provide the number of leaf nodes. The black 
lines represent the best BART configuration, and the blue solid lines are RGF-L2 with square loss. The 
CT slices results are not shown as BART could not complete the training due to memory shortage. The 
BART performance is highest with 200,000 trees (the tested maximum) and relatively low with a small 
number of trees. On most of the tested datasets, at least one of the RGF configurations exceeds the BART's 
best performance with 100 to 1000 trees. Although BART and RGF share some high-level strategies, an 
interesting difference is that BART does not attempt to keep models compact. As a result, in our experiments, 
RGF required roughly 100 times fewer trees to achieve performance either better or comparable accuracy 
compared with BART. This means that prediction using the models obtained by RGF could be 100 times 
faster than prediction using the models obtained by BART, which may be significant in some applications. 



6.4 Running time 

We have shown that RGF often achieves higher accuracy than GBDT, but this is done at the cost of ad- 
ditional computational complexity mainly for fully-corrective weight updates. In this section we analyze 
running time in terms of the following factors: I, the number of leaf nodes generated during training; d, 
dimensionality of the original input space; n, the number of training data points; c, how many times the 
fully-corrective weight optimization is done; and z, the number of leaf nodes in one tree, or tree size. In 
RGF, tree size depends on the characteristics of data and strength of regularization. Although tree size can 
differ from tree to tree, for simplicity we treat it as one quantity, which should be approximated by the 
average tree size in applications. 

In typical tree ensemble learning implementation, for efficiency, the data points are sorted according to 
feature values at the beginning of training. The following analysis assumes that this "pre-sorting" has been 
done. Pre-sorting runs in 0(ndlog(n)), but its actual running time seems practically negligible compared 
with the other part of training even when n is as large as 100,000. 

Recall that RGF training consists of two major parts: one grows the forest, and the other optimizes/corrects 
the weights of leaf nodes. As in our experiments, assume that only the most recently-grown tree is searched 
during forest building and weight optimization is done by coordinate descent with a fixed number of itera- 
tions. First, we consider running time excluding the processing for regularization as it varies with the type 
of regularizer. The part to grow the forest runs in 0(nd£), same as GBDT. Weight optimization takes place c 
times, and each time we have an optimization problem of n data points each of which has at most - nonzero 
entries. Therefore, the running time for optimization, excluding regularization, is 0( n j^) using coordinate 
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Table 3: Average Training Time in Seconds. 2000 data points. RGF: average over 3 configurations of RGF-L 2 - 
GBDT: average over 9 best-performing configurations. Square loss. 
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Figure 15: Training Time of RGF. The circles and triangles are RGF with min-penalty regularizers with and 
without the zero-to-sum sibling constraints, respectively. 



descent implemented with sparse matrix representation. 

During forest building, the partial derivatives and the reduction of regularization penalty are referred to 
0(nd£) times. During weight optimization, the partial derivatives of the penalty are required O(ic) times. 
With RGF-L2, computation of these quantities with respect to one leaf weight is contained in the node of 
interest and runs in O(l). The extra running time for that is practically negligible. Computation of min- 
penalty regularizers involves O(z) nodes; however, with efficient implementation that stores and reuses 
invariant quantities, extra running time for min-penalty regularizers during forest building can be reduced 
to 0(nd£) + 0(£z 2 ) (instead of 0(nd£z), which could be large with a large amount of training data). The 
extra running time during weight optimization is 0(£cz), and the constant part can be substantially reduced 
by efficient implementation. Details are shown in the Appendix. 



Empirical running time In Table[3j we show the elapsed time of RGF on the tested datasets in comparison 
with GBDT. The RGF column shows the average over three RGF-L2 configurations with A G {1,0.1, 0.01}. 
The GBDT column shows the average over the best-performing configurations, which are 9 combinations 
of tree size and the shrinkage parameter that achieved the best performance at least once in Figures 6 - 9 
For both, the loss function was square loss; £ was set to the maximum value of the x-axis in Figures 6-9 
and n was set to 2000 as before. RGF training mostly took 2-5 times longer than GBDT, as shown in the 
right-most column. 

Figure[l5]plots running time in relation to tree size z which is approximated by the average tree size. The 
points X's are RGF-L2 with A set to practically useful values {1, 0.1, 0.01}. The points +'s are RGF-L2 with 
much larger A, which would not be used in practical applications but are shown for running time analysis. 
The triangles and circles are with min-penalty regularizers with A G — , ^^} with 7 G {2, 4}. On each 
dataset, n, £, and c were fixed to the same values as in Figure [3] On the 4-leaf synthesized dataset, on which 
d is relatively small, weight optimization in O(^) dominates over forest building in 0(nd£), so the running 
time is nearly inversely proportional to z. The additional running time for min-penalty regularization seems 
negligible with small tree sizes z and more prominent with larger z. By contrast, on CT slices, whose d is 
relatively large, forest building in 0(nd£) dominates over weight optimization in O(^), so the influence 
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of the tree size z on RGF-L2 is small. 

7 Conclusion 

This paper introduced a new method that learns a nonlinear function by using an additive model over non- 
linear decision rules. Unlike the traditional boosted decision tree approach, the proposed method directly 
works with the underlying forest structure. The resulting method, which we refer to as regularized greedy 
forest (RGF), integrates two ideas: one is to include tree- structured regularization into the learning formu- 
lation; and the other is to employ the fully-corrective regularized greedy algorithm. Since in this approach 
we are able to take advantage of the special structure of the decision forest, the resulting learning method 
is effective and principled. Our empirical studies showed that the new method can achieve more accurate 
predictions with smaller forests than the tested existing methods, especially for more complex nonlinear 
functions that require stronger regularization. 
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A Appendix 



A. 1 Convergence of Algorithm [4] 

To show that Algorithm[4]converges, let us express the algorithm using matrix multiplication as follows. Let 
J be the number of internal nodes, and define a matrix A G ~R JxJ and a column vector b G M J so that: 

TT2^ w=p{v) 
A\v,w] = ( p{w) = v , b[v}= ^ 







v 

otherwise 



p(w)=v ,w£Lt 



_JOw_ 

l + 2 7 



Define B G R( J + 1 ) X ( J + 1 ) and j3 G M n+1 so that: 



Then since we have: 



B 



B' 



A b 

1 







E 



m—l 
i=0 



A 2 b 



in 



iterations of Algorithm|4]is equivalent to: 



a v v G Lt 

(B m ^) H = (YZ "o 1 A J b) H „ £ L T 



It is well known that for any square matrix Z, if ||Z|| P < 1 for some p > 1 then I — Z is invertible and 
(I - Z)- 1 = J2T=o zk - Therefore, it suffices to show that || A|| p < 1 for some p > 1. 

First, consider the case that 7 = 1. In this case, A is symmetric and column v (and row v) has |iV(u)| 
non-zero entries, where N(v) denote the set of the internal nodes adjacent to v, and all the non-zero entries 
are 1/3. 

Using the fact that \N(v)\ < 3 for v / ot and \N(ot)\ < 2, for any x G M J , we have: 

2 



I Ax||! 



j 




sE E 



2x[fc] • x[£] 



j k,£&N(j),k<£ 



> 



5? E 



(x[fc] -x[£]) 2 > 



j k,££N(j),k<£ 

Therefore, ||A||2 < 1. 

Next suppose that 7 > 1. Then we have: 

J 1 
||A||i = max V \A\i, j]\ < 2 1 — < 1 



i=l 



Hence, Algorithm |4] converges with 7 > 1. 

Another way to look at this is that G Lt- [fiv = &v\ in ( |Toj ) and (13) can be expressed using the 
matrix notation above as: J} = A/3 + b. As shown above, I — A is invertible, and therefore {j3 v } with 
desired properties can be obtained by (3 = (I — A) _1 b. Algorithm [5] computes it iteratively. 

Our implementation used its slight variant (Algorithm [6]) as it converges faster. 
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Algorithm 6: 



forall the v G T do /3 V < 
for i = 1 to m do 



a v v <E Lt 
v ^ L T 



for u ^ in some fixed order do /3 V <^ < a* 



1 + 

w) — v i 



V ^ Ot 

V = Ot 



end 



A.2 Computational detail of the min-penalty regularization in Section 5.4.2 



To optimize weights according to ([8]>, we need to obtain the derivatives of the regularization penalty, 

dg(T(6 u )) an( j 8 S(T{Su)) where 5 V is the additive change to a„ , the weight of a leaf node u, 

dS u \5 U =0 dK \S u =o" u 6 u _ 6 

and T is the tree to which node u belongs. Let {p v } = arg minrg i {/({/?«}) : G £t-[A = «d]} so 



that (T) = A • f({pv}), where / and G(T) are defined in terms of auxiliary variables as in ( 1 1 1 and ( |T2| ) 

From the derivation in the previous section, we know that p w is linear in leaf weights {«„}. In particular, 
p w for an internal node w can be written in the form of p w = ^2 v£Lt c W:V a v with coefficients c w ^ v that are 
independent of leaf weights and only depend on the tree topology. Also considering p v = a v for v G Lt, 
we have: 

Cw,u w <£ L T 
1 w = u 
w 7^ u & w G Lt 

c WtU can be obtained by Algorithm [4] (or [6]) with input of: a u = 1; at = for t ^ u; and the topological 



structure of T. Also using = 0, it is straightforward to derive from the definition of / in (112} that: 



dQ(T(5 u )) =xT d w ^ d*g(T(5 u )) =xTj d w (dpA 2 (15) 

where p w = p w — p v i w ) ^ w °T] Pw otherwise. (16) 

The optimum change to the leaf weight a u can be computed using these quantities. Loss reduction caused 
by node split can be similarly estimated. 

For efficient implementation The key to efficient implementation is to make use of the fact that in the 
course of training certain quantities are locally invariant, by storing and reusing the invariant quantities. 
First, since the iterative algorithm is relatively complex, it should be executed as infrequently as possible. 
As noted above, p w 's partial derivatives only depend on the topological structure of the tree, so they need to 
be computed only when the tree topology changes. When 5 V is added to a v , p w should be updated through 
pw^Pw + Sv^ 1 instead of running the iterative algorithm. Second, consider the process of evaluating 
loss reduction of all possible splits of some node, which is fixed during this process. Using the notation in 



Section 6.4 the partial derivatives of the regularization penalty similar to ( 15 ) are referred to 0{nd) times 
in this process, but they are invariant and need to be computed just once. The change in penalty caused by 
node split is evaluated also 0(nd) times, and one can make each evaluation run in 0(1) instead of 0(z) by 



storing invariant quantities. To see this, as in Section 5.3.1 consider splitting a node associated with weight 
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a, and let for k = 1, 2 be the new leaf nodes after split with weight a + 5k- Write T(S\, 62) for the new 
tree. Define p w and p w as above but on T(0, 0) instead of T. To simplify notation, let p w ± = gJ^" ■ Due 

to symmetry, we have: p W; % = p w ^ for to / ui & u; / «2 • The change in penalty is: Q{T(5\, 82)) — 
g(T) = (g(f '(61,82)) -Q(f '(0,0) 1)) + (C/(T (0, 0)) - <7(T)V Since the second term is invariant during 
this process, it needs to be computed only once. To ease notation, let 



2 1 / 2 y 

">v E 6kPv,k + 2 I E Sk P v , k ) 

fc=l \fc=l / 



2 



Then the first term in the penalty change can be written as: 

(g(f(8 1 ,8 2 )) - g(f(o, o))) = 2 E ^ (a. + E ) - J E ^pl = E a - 

=0*1 + 62) E 7*^,1 + + 52)2 + A «i + A «2 • 

vET— {111,1*2} t)GT— {1*1,1*2} 

Here X) v6E t -{1*1,1*2} 1 dv PuPvA and Z)t,ef-{ui,«2} T^P^i are mvarian t during this process and can be pre- 
computed; therefore, each evaluation of penalty differences runs in O(l). 



A.3 Computational detail of the regularizer in Section 5.4.3 



The same basic ideas above can be applied to make efficient implementation of the regularizer with sum-to- 
zero sibling constraints. To optimize the additive change 5 U of the leaf weight a u , let {p v } be the arguments 
that minimize (|9| so that Q(T) = A • Y^veT 7 d "P 2 / 2 - Then the partial derivatives of Q(T(6 U )) are obtained 



by < \15\ . From ( |14| ), we have: 

dp 



2 du k W = OT 

ndw—d,,,,—! 



da u 



w 7^ ot, w G A(u) (w is either u or iTs ancestor) 
—2 dw ~ du k~ l w £ A(u),p(w) G A(u) (w is u's ancestor's sibling) 
otherwise 



Optimization can be done using these quantities. 

Regarding efficient implementation, one difference from the regularizer without the sibling constraints 
above is that Q(T(0, 0)) = Q(T) with this regularizer and therefore Q(T(0, 0)) does not have to be com- 
puted. 
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